Description

The data we’ll be using is a public dataset of ~8k PBMCs from 10x Genomics.

Sequencing data from scRNA-seq libraries is typically aligned to the transcriptome using the tool CellRanger (made by 10x Genomics). This process takes hours-to-days to run and requires access to a linux cluster, but fortunately, it’s fairly straightforward so we’ll just start working with the output generated by CellRanger. A few other tools have been developed to replace CellRanger, but for the most part, they just present improve run time rather than improved alignment/quantification.

DISCLAIMER: For this workshop, I will present a fairly standard workflow that is incredibly user-friendly. However, alternative tools exist for every step of the pipeline. Often, the tooling choice will largely just be a preference, but in some cases, these alternatives are demonstratably better. If you are considering getting more involved in the field, I encourage you to read the most recent benchmark papers and newest preprints–the field is still moving incredibly fast.

Load package dependencies

library(Seurat)
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.4
## ✓ tibble  3.0.1     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Load the data into R

Seurat has an easy function to load in data generated by CellRanger. It requires the presence of three files that are outputed: barcodes.tsv, features.tsv, matrix.mtx.

This funciton will read those files into a single sparse matrix variable in R with genes as rownames and cell barcodes at colnames.

pbmc.data <- Read10X(data.dir = "../data/filtered_gene_bc_matrices_8kpbmc/GRCh38/")

If working with public data, the names of files are often modified (eg. by NCBI’s GEO system) and you will have to switch them back.

Alternatively, many papers will simply provide this genes-by-cells matrix as a CSV file. If that is the case, you can use something the following code to read it into R: my_data <- read.csv(“~/Folder/location/of/their_data.csv”, row.names=1)

Move the data into a “Seurat Object”

If you look in your environment, you’ll see that the expression matrix has been loaded into R as the variable “pbmc.data”. This is great, but once you start doing more analysis, it can start getting messy if you start making dozens of other variables containing things like cell metadata, UMAP coordinates, etc.

This issue is fixed by a special class of variable called a Seurat object. A seurat object serves essentially as an organized container to hold almost all types of data you will work with in scRNA-seq analysis. The two benefits of this are 1) it simply keeps everything more organized than managing dozens of independent R variables, and 2) it imposes “rules” that help prevent you from messing your data up (for example, it demands that the cells in your metadata table are an exact match with the cells in the expression matrix).

Let’s store the expression matrix in a brand new Seurat object. I will generically name the seurat object “seurat” so that I can re-use code between experiments as much as possible

seurat <- CreateSeuratObject(counts = pbmc.data, 
                             min.cells = 3, 
                             project = "PBMC")
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
seurat
## An object of class Seurat 
## 18340 features across 8381 samples within 1 assay 
## Active assay: RNA (18340 features, 0 variable features)

Quality control & Filtering

Here, we will explore three main quality control metrics: 1) the percentage of transcripts per cell that align to the mitrochondrial genome 2) the number of genes with detectable transcripts in each cell 3) the number of unique transcriptions (based on UMIs) per cell

NOTE: The distribution of percent.mito is suuuuper sample-specific. You’ll have to change the cutoffs for your own sample

seurat[["percent.mito"]] <- PercentageFeatureSet(seurat, pattern = "^MT-")
VlnPlot(seurat, 
        features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), 
        pt.size = 0.25,
        ncol = 3)

Alternatively, you can grab these values out of the Seurat object and plot them yourself:

hist(seurat@meta.data$percent.mito, breaks=50)

You can also see that, consistent with the loss of cytoplasmic integrity, cells with high mitochondrial reads have low gene content:

qplot(x=nFeature_RNA, y=percent.mito, data=seurat@meta.data)

Filtering thresholds are quite arbitrary right now, but here, removing cells with >7.5% mitochrondrial reads seems good. It’s also common to remove cells with less than ~200 genes to remove “low information” cells. No cells in this dataset have fewer than that, but I’ll still include it in the code so you can see how it is done

seurat <- subset(seurat, 
               subset = nFeature_RNA > 200 & 
                 percent.mito < 7.5)

Normalization

Here, we’ll do a straightforward “naive” normalization approach: log-transformation and library size scaling

A few great alternatives have been propose (eg. SCTransform, which is also built into the Seurat package), but their use is a bit more nuanced, so I’ll avoid them for this workshop. I’d recommend checking them out though.

seurat <- NormalizeData(seurat)

Dimensionality Reduction

Usually from here, we’re interested in doing some exploratory analysis of our data. We want to make visualizations like tSNE/UMAP plots to see the “structure” of our data. We want to cluster cells based on their expression profiles. With that, we can start doing things like differential expression to identify which cell types are present in the data and which genes are good markers to distinguish different cellular populations.

All of this analysis requires us to perform “Dimensionality Reduction”. Currently in our data, each cell could be represented as a single point in 18,340-dimensional space (each dimension corresponding to a single gene). Obviously, this is impossible for us to manually explore unless we have prior knowledge about important dimensions/genes for our system. Luckily, gene expression is structured. If two genes are perfectly correlated, we could represent two genes worth of information with a single value. The goal of dimensionality reduction is to take advantage of this structure to “refine” 18,340 dimensions worth of information into a smaller number of dimensions.

Finding highly variable genes

Some of the genes in our data are like housekeeping genes and their expression is pretty constant across different cell types/states. As such, these genes don’t really contribute to the structure/heterogeneity of our data. One step we can take to refine our data down is to ignore these genes and only focus on genes that do contribute. To do this, we measure the variance of each gene’s expression across the entire population of cells. Genes with low variance are things like housekeeping genes and don’t tell us much. Cell type-specific markers are examples of genes that would have high variance because in some cells there is little-to-no expression, while in others they could have very high expression.

By focusing only on genes with high variance, we can retain much of the structure of our data in as few as a couple thousand genes.

Let’s find the top 2000 variable genes:

seurat <- FindVariableFeatures(seurat,
                               selection.method="vst",
                               nfeatures = 2000)

And just for our interest, let’s check out the top 20 variable genes

head(VariableFeatures(seurat), n=20)
##  [1] "IGLC3"      "IGLC2"      "PPBP"       "FCER1A"     "PF4"       
##  [6] "PTGDS"      "GNLY"       "GZMB"       "LILRA4"     "JCHAIN"    
## [11] "S100A8"     "SDPR"       "IGKC"       "AL928768.3" "S100A9"    
## [16] "HLA-DQA1"   "TUBB1"      "HLA-DRA"    "CD74"       "HLA-DPA1"

From this alone, we see many immune signatures we may expect, like key markers of MHC components, immunoglobulin components, etc.

Principal Component Analysis (PCA)

So 2000 dimensions is better than 18,000 but it’s still too large to explore. Luckily we have more tools to refine the structure down into fewer dimensions.

As I mentioned before, gene expression is quite modular due to co-regulation. PCA can identify this structure and generates principal components (PCs), which are single dimensions/axes/features that are a linear combination of gene expression measurements (see PPT slides). So instead of having a genes-by-cells matrix, we will embed our data in PC space and have PC-by-cell matrix.

Optional Demo: Correlation heatmap of variable genes

To demonstrate the structure of gene expression, we can simply calculate the pearson correlation of these top 2000 variable genes.

exp <- as.matrix(seurat[["RNA"]]@data[VariableFeatures(seurat),])
exp <- t(cor(t(exp), method="pearson"))
pheatmap::pheatmap(exp,
                   breaks = seq(-0.2, 0.2, length.out=101),
                   clustering_method = "ward.D2",
                   show_rownames = F,
                   show_colnames = F)

You’ll see that there are distinct modules of highly correlated genes. These often correspond to programs of distinct cell types.

PCA continued

Because PCA works by identifying sources of variation in the data, we have to standardize the means and variance of our data using a Z-score transformation. This is important so that every gene is considered equally and genes with a higher average expression don’t get weighed more in the PCA. Seurat has a convenient function for this:

seurat <- ScaleData(seurat)
## Centering and scaling data matrix

Side Note: Sometimes, there are technical “artifacts” in our embeddings that we don’t want. Eg. Sometimes on PCA, you’ll find that certain PCs are directly correlated with things like percent.mito (cell viability), cell cycle, etc. The effects of these variables can be “normalized” out through a linear regression model (we often call it “regressing it out”). I’m not going to add it here, but you could add the vars.to.regress option to ScaleData() to do this. eg. ScaleData(seurat, vars.to.regress=c(“percent.mito”, “S.Score”, "G2M.Score)).

Now we can run the PCA

seurat <- RunPCA(seurat)
## PC_ 1 
## Positive:  LTB, TRAC, CD3D, IL32, TRBC2, CD69, CD3G, CD7, CD27, IL7R 
##     IFITM1, RPS2, CD2, LEF1, CCR7, TRBC1, SPOCK2, NOSIP, CD247, CTSW 
##     TRAT1, GZMM, MAL, CD8B, CD8A, ITM2A, SYNE2, PIM2, AQP3, CDC25B 
## Negative:  CST3, LYZ, CSTA, MNDA, LST1, TYROBP, FCN1, FTL, AIF1, CTSS 
##     TYMP, FCER1G, S100A9, RP11-1143G9.4, LGALS1, FTH1, S100A8, LGALS2, SERPINA1, SPI1 
##     FGL2, S100A11, GRN, PSAP, CFD, GPX1, AP1S2, MS4A6A, CLEC7A, CD68 
## PC_ 2 
## Positive:  IL32, CD3D, TMSB4X, CTSW, CD7, TRAC, GZMA, NKG7, S100A4, CST7 
##     IFITM1, CCL5, GZMM, CD247, ANXA1, PRF1, CD2, KLRD1, CD3G, PFN1 
##     SRGN, S100A6, ITGB2, KLRB1, ID2, HOPX, IL7R, LYAR, KLRG1, GNLY 
## Negative:  CD79A, MS4A1, IGHM, IGHD, CD79B, TCL1A, IGKC, LINC00926, BANK1, CD74 
##     CD22, HLA-DQB1, HLA-DRA, HLA-DPA1, VPREB3, TNFRSF13C, HLA-DPB1, FCER2, FAM129C, MEF2C 
##     RALGPS2, HLA-DRB1, HLA-DQA1, HLA-DOB, FCRLA, SPIB, EAF2, HLA-DMB, HVCN1, TSPAN13 
## PC_ 3 
## Positive:  LEF1, CCR7, RPS2, MAL, S100A12, JUNB, TRAC, LTB, S100A8, VCAN 
##     S100A9, NOSIP, RGCC, RP11-1143G9.4, CD14, LRRN3, IL7R, FCN1, NELL2, RBP7 
##     FHIT, FOS, CXCL8, CSTA, CD27, TMSB10, AIF1, SOCS3, MNDA, CD3D 
## Negative:  GZMB, NKG7, PRF1, KLRD1, CST7, CLIC3, GZMA, FGFBP2, KLRF1, GNLY 
##     HOPX, SPON2, C12orf75, CD160, RHOC, CCL4, CCL5, CTSW, GZMH, XCL2 
##     FCGR3A, MATK, CMC1, TRDC, S1PR5, PLEK, ADGRG1, IL2RB, APOBEC3G, TBX21 
## PC_ 4 
## Positive:  LILRA4, LRRC26, SERPINF1, CLEC4C, IL3RA, SCT, TPM2, DERL3, PPP1R14B, PTCRA 
##     PLD4, LINC00996, TNFRSF21, ITM2C, DNASE1L3, SMPD3, LAMP5, MAP1A, GAS6, UGCG 
##     PTPRS, RP11-73G16.2, MYBL2, JCHAIN, ASIP, SCAMP5, FCER1A, CCDC50, IRF7, APP 
## Negative:  KLRD1, NKG7, FGFBP2, PRF1, CST7, KLRF1, GZMA, CD79B, MS4A1, GNLY 
##     HOPX, CD79A, IGHD, CCL4, CD160, FCGR3A, LINC00926, SPON2, XCL2, FCER2 
##     S1PR5, TRDC, GZMH, CD22, TTC38, MT-CO1, MATK, ADGRG1, TNFRSF13C, VPREB3 
## PC_ 5 
## Positive:  RPS2, TMSB10, LILRA4, LRRC26, SERPINF1, IL3RA, PPP1R14B, PLD4, CLEC4C, CCDC50 
##     SCT, DERL3, LINC00996, DNASE1L3, TPM2, IRF7, UGCG, TNFRSF21, LILRB4, LAMP5 
##     SMPD3, MYBL2, ASIP, SCAMP5, JCHAIN, CORO1A, MT-CO1, GAS6, ITM2C, RP11-117D22.2 
## Negative:  SDPR, PPBP, GNG11, TUBB1, PF4, CLU, SPARC, HIST1H2AC, ACRBP, GP9 
##     CTTN, ITGA2B, CMTM5, TMEM40, C2orf88, TSC22D1, HRAT92, MMD, MAP3K7CL, NRGN 
##     CLDN5, TREML1, HGD, HIST1H2BJ, GMPR, CA2, CD9, MYLK, SH3BGRL2, ESAM

By default, Seurat’s RunPCA() function will print out the genes with the top “loadings” for each PC.

Genes with the highest positive loading will be the “strongest forces” contributing to a cell’s position higher up on that PC. Genes with the the lowest (negative) loading do the opposite, lowering a cell’s value on that PC.

Let’s look into that a big more

Visualize the PCA

Seurat has a couple easy visualization functions. DimPlot() lets us look at our embeddings, colouring the points by any categorical info stored in the Seurat object (eg. cluster) and FeaturePlot() lets us look at our embeddings, colouring points by any quantitative value in our Seurat object (gene expression, quantitative metadata, etc).

DimPlot(seurat, reduction='pca')

And quickly, let’s look at a couple of those genes with high loadings with PC1/2. With FeaturePlot() you can give it as many genes as you want and it will generate multiple panels

FeaturePlot(seurat, features=c("CD3D", "LYZ", "NKG7", "CD79A"),
            cols = c('lightgrey', 'red'))

From this alone, we can see pretty distinct populations of monocytes (LYZ) and B cells (CD79A), but interestingly, we don’t get much separation of our T cells (CD3D) and NK cells (NKG7).

This is an important feature of PCA: it is a linear dimensionality reduction (feel free to read up on this more). For complex data, it will be impossible to capture all of the data’s structure in as few as two dimensions.

Let’s look at how much variation there is along each of the 50 PCs we computed.

ElbowPlot(seurat, ndims=50)

You can see that eventually the variance per PC plateaus. This is usually the point were the PCA is just capturing noise in the data. This means that ~20-30PCs have more variance above this noise, suggesting that these are probably the PCs that capture biological information.

So, we went from 18,000 dimensions, to 2000 dimensions, and now down to 20-30. We’re getting there, but unfortunately 30 dimensions are still hard to visualize.

UMAP: non-linear dimensionality reduction

You’ve probably heard of tSNE (t-distributed stochastic neighbor embedding) or UMAP (uniform manifold approximation and projection). These are both examples of non-linear dimensionality reduction methods. They are capable of embedding complex data in a very small number of dimensions. Both essentially try to optimize 2/3-dimensional organization of data where the distance between points in low dimension space is similar to distances in high dimensional space. As a result, similar data points in gene expression will be placed close to each other in the 2-3D embedding. In the end, the axes themselves mean nothing (eg. high x-axis values do not mean higher expression of certain genes). Similarly, because of how the algorithms are set up, things like the shape of the “blobs” on the plot or the relative positions of the “blobs” are also meaningless.

I’d recommend checking out this article on tSNE to see how different parameters and input data can affect the resultant tSNE embedding: https://distill.pub/2016/misread-tsne/

All this said, they are still incredibly effective tools for visualizing our data.

It’s common to use PCA embeddings as the input for the tSNE/UMAP algorithm. This has two main benefits: 1) compute time–these algorithms can be very slow, and even 2000 genes can take a while to run. 2) The idea of PCA was pull out the signal from our data in a relatively small number of PCs. By inputting only the “meaningful” PCs (ie. those above that background amount of noise we talked about above) into the UMAP/tSNE algorithm, we’re helping the signal-to-noise of the data for these tools to work with. From that “elbow plot” above, we saw that the first 20-30 PCs capture most of our biological variation. Let’s meet in the middle and use 25 PCs for the UMAP algorithm

seurat <- RunUMAP(seurat, dims=1:30)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 21:56:32 UMAP embedding parameters a = 0.9922 b = 1.112
## 21:56:32 Read 8337 rows and found 30 numeric columns
## 21:56:32 Using Annoy for neighbor search, n_neighbors = 30
## 21:56:32 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 21:56:33 Writing NN index file to temp file /var/folders/y9/j0gdhnsd1mn1lj2gsfqr532h0000gp/T//RtmpdCKen0/file39c45ab1c62
## 21:56:33 Searching Annoy index using 1 thread, search_k = 3000
## 21:56:36 Annoy recall = 100%
## 21:56:36 Commencing smooth kNN distance calibration using 1 thread
## 21:56:37 Initializing from normalized Laplacian + noise
## 21:56:37 Commencing optimization for 500 epochs, with 360540 positive edges
## 21:56:51 Optimization finished

Visualize the UMAP

DimPlot(seurat) #UMAP will have been made the default visualization for DimPlot by Seurat

Great! We’re now getting really good seperation of populations. Let’s look at those same markers.

FeaturePlot(seurat, features=c("CD3D", "LYZ", "NKG7", "CD79A"),
            cols = c('lightgrey', 'red'))

Notice that we now have the complete separation of a population of NKG7+ cells that were completely obscured in the PCA plots.

Save point

This is a good time to save a version of processed data. Many steps after this point can be modified, so it’s nice to have this version saved for you to come back to.

saveRDS(seurat, file="../output/pbmc_seurat_filtered.rds")